Char2wav: End-to-end Speech Synthesis
نویسندگان
چکیده
We present Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoderdecoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocoder refers to a conditional extension of SampleRNN which generates raw waveform samples from intermediate representations. Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.
منابع مشابه
JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis
Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the"JSUT corpus,"that is aimed at achieving end-to-end...
متن کاملObamaNet: Photo-realistic lip-sync from text
We present ObamaNet, the first architecture that takes any text as input and generates both the corresponding speech and synchronized photo-realistic lip-sync videos. Contrary to other published lip-sync approaches, ours is only composed of fully trainable neural modules and does not rely on any traditional computer graphics methods. More precisely, we use three main modules: a text-to-speech n...
متن کاملEnd-to-End Neural Speech Synthesis
In recent years, end-to-end neural networks have become the state of the art for speech recognition tasks and they are now widely deployed in industry (Amodei et al., 2016). Naturally, this has led to the creation of systems to do the opposite – end-to-end speech synthesis from raw text. Very recently, neural TTS systems have become highly competitive with their conventional counterparts, showi...
متن کاملAn HMM/DNN Comparison for Synchronized Text-to-Speech and Tongue Motion Synthesis
We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a statistical shape space model of the tongue surface to an articulatory speech corpus and training a speech synthesis system directly on the tongue model parameter weights. We focus our analysis on the application of two standard me...
متن کاملTwo-way speech-to-speech translation on handheld devices
This paper presents a two-way speech translation system that is completely hosted on an off-the-shelf handheld device. Specifically, this end-to-end system includes an HMM-based large vocabulary continuous speech recognizer (LVCSR) for both English and Chinese using statistical -grams, a two-way translation system between English and Chinese, and, a multilingual speech synthesis system that out...
متن کامل